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Abstract 

-^When  solving  optimisation  problems  on  a  parallel  computing  system,  the 
first  consideration  must  be  to  utilise  the  parallelism  to  speed  up  the  95X 
of  the  time  typically  spent  in  function  and  gradient  calculations.  This 
is  normally  done  in  one  of  two  ways  namely  the  parallel  computation  of 
functions  or  gradient  evaluations  or  the  division  of  each  function 
evaluation  into  a  number  of  parallel  tasks. 

Assuming  this  prime  task  has  been  undertaken  effectively  then  for 
efficiency  the  other  5%  of  the  computation  must  also  utilise  the 
parallelism  available  on  the  system. 

The  dominant  remaining  calculation  is  usually  the  solution  of  a  set  of 
linear  equations. 

i 

i  In  this  paper  the  implication  of  parallel  processing  on  the  solution  method 
/  for  solving  linear  equations  will  be  reviewed.  - 
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1.  Introduction 


In  this  report  we  will  be  concerned  with  the  optimisation  problem 
Min  F(x)  x  e  Rn . 

Occasionally  we  will  assume  simple  upper  and  lower  bounds  of  the  form  li  < 
xt  <  ut  exist . 

There  is  of  course  a  complete  theory  for  the  convergence  of  iterative 
algorithms  which  generate  a  sequence  of  estimates 
x,k+1)  -  x(kl  ♦  ap{k) 

and  many  efficient  codes  exist  for  the  solution  of  such  problems  on  a 
sequential  machine.  These  have  successfully  solved  many  optimisation 
problems . 

The  question  therefore  arises  as  to  why  then  we  should  be  interested  in 
introducing  the  parallel  processing  concept  into  numerical  optimisation. 

The  main  reasons  that  influenced  us  were: 

jL 

(1)  that  we  knew  of  industrial  problems  that  took  an  embarassingly  long 
time  on  a  sequential  machine 

and  we  knew  too  that 

(2)  industry  only  poses  problems  that  it  thinks  might  be  soluble. 

By  introducing  the  parallel  processing  concept  into  numerical 

optimisation  we  hoped  to  be  able  to  extend  the  range  of  soluble  problems. 

Ue  identified  four  different  situations  where  we  felt  that  the  solution  of 
optimisation  problems  would  most  benefit  from  the  availability  of  parallel 
processing  machines.  These  were:- 
1)  Small  Dimensional  Expensive  Problems 

These  are  typified  by  industrial  problems  which  frequently  have  a  small 
dimension  n  <  100  but  where  the  time  required  to  compute  the  function  and 
gradient  values  at  x  can  be  considerable  and  where  this  dominates  the 
computation  within  the  algorithm. 


2)  Large  Dimensional  Problems 

There  are  many  large  dimensional  problems  n  >  2000  where  the  combined 
processing  time  and  storage  requirements  cause  difficulties. 

3)  On  Line  Optimisation 

There  are  many  on  line  optimisation  problems,  for  instance  the 
optimisation  of  car  fuel  consumption  which  cannot  be  easily  solved  using 
existing  sequential  optimisation  codes  on  the  type  of  processors  that  could 
be  easily  installed  within  a  car  but  which  might  be  solved  on  more  than  one 
such  processor. 

4)  Multi  Extremal  (Global)  Optimisation 

Problems  in  which  the  objective  function  has  many  local  minima  and 
where  the  real  problem  is  to  identify  the  best  of  these,  still  present  many 
difficulties  because  the  available  sequential  codes  are  weak  and  expensive 
in  computer  time. 

In  all  four  of  these  areas  the  availability  of  parallel  processors 
promised  significant  improvements  and  in  each  that  promise  has  been 
achieved.  In  this  lecture  we  will  mainly  be  concerned  with  the  first 
class  of  problem  but  the  solution  of  sets  of  linear  equations  is  an 
essential  step  in  the  solution  of  all  four  classes  of  problem. 

2j _ Optimisation  Problems  and  Algorithms 

It  is  usual  to  find  on  analysing  the  solution  of  most  industrial 
optimisation  problems  that  at  least  95%  of  the  computer  time  is  spent  in 
evaluating  the  values  of  the  objective  function  F(x)  at  x  and  only  5%  of 

the  time  within  the  optimisation  code. 

It  is  therefore  natural  to  concentrate  on  the  speed  up  of  the 
calculation  of  the  engineering  model  F(x)  rather  than  the  code. 

This  can  effectively  be  done  in  two  distinct  ways.  If  we  assume  a 


computer  system  with  P  processors  that  can  act  in  parallel. 


The  calculation  of  each  objective  function  value  F(x)  is  divided  into  P 
parallel  tasks.  This  approach  leaves  the  responsibility  for  the  efficient 


use  of  parallelism  in  the  hands  of  the  user. 

Approach  B 

The  algorithm  is  modified  so  that  it  can  accept  P  values  of  F(x)  or 
VF(x)  computed  simultaneously.  This  places  the  responsibility  for  the  use 
of  parallelism  on  the  algorithm  designer. 

At  the  NOC  we  have  developed  codes  of  both  type  A  and  B.  The  finite 
element  optimisation  approach  to  the  solution  of  nonlinear  partial 
differential  equations  Singh  ( 1983 ) [ 1 1  naturally  leads  to  optimisation 
problems  that  divide  into  parallel  tasks.  This  has  been  implemented  with 
very  significant  speed  up  on  the  ICL-DAP  by  Ducksbury  ( 1985) 1 2 J .  Indeed 
any  partially  separable  objective  function  Toint  (1986)13]  can  be  solved  in 


this  way. 

Early  implementations  of  a  B  type  approach  in  which  gradient  and 
Hessian  information  were  approximated  by  difference  in  function  values  at 
different  points  computed  in  parallel  were  reported  by  Patel  ( 1984 ) [ A  J . 

With  the  availability  of  automatic  differentiation  Rail  ( 1981 ) [ 5 ]  which 
can  be  implemented  effectively  in  ADA  Mohseninia  ( 1 987 ) [ 6 J ,  codes  are  now 
being  written  to  utilise  concurrent  tasking  to  calculate  the  components  of 
the  gradient  (and  when  necessary  the  Hessian  matrix)  on  parallel 
processors.  In  this  context  Mohseninia  has  found  that  it  is  natural  and 
efficient  to  utilise  the  partially  separable  structure  when  computing  the 
function  and  gradient  values  as  parallel  (concurrent)  tasks. 

More  details  of  these  approaches  is  given  in  Dixon  &  Price  ( 1 986 ) ( 7 ] . 

If  we  assume  that  when  solving  class  1  optimisation  problems  that 
typically  9bX  of  the  computation  is  spent  calculating  function  and  gradient 
values  then  a  simple  calculation  demonstrates  that  to  obtain  significant 


speed  up  factors  the  remaining  5%  of  the  codes  must  also  utilise  the 
parallelism  available.  Assuming  perfect  efficiency  in  calculating  the 
function  values  in  parallel  on  P  processors  and  letting  t(P)  be  the  time  of 
computation  we  would  have 
f  -95 

x(P)  =  -  +  *05  T(l) 

l  P 

T(l) 

so  the  ratio  -  has  an  upper  bound  of  20  even  if  there  are  4000 

x(P) 

processors . 

It  is  therefore  essential  to  reduce  5%  of  the  time  spent  in  the 
optimisation  code.  In  most  optimisation  codes  this  calculation  is 
dominated  by  the  time  required  to  solve  a  set  of  equations 
Ax  =  b  x  e  Rn 

typically  in  unconstrained  optimisation  this  will  be  some  approximation  of 
the  Newton  equation 
V2Fx  =  -  VF 

while  for  REQP  methods  of  constrained  optimisation  A  and  b  also  contain 
constraint  information. 

Even  for  unconstrained  optimisation  the  most  apropriate  approximations 
A  and  b  are  dependent  on  both  n  and  P.  Byrd,  Schnabel  and  Shultz 
( 1 987 ) [ 8 ]  discuss  modifications  to  Approach  B  appropriate  when  n  +  1  <  P  < 
(n2  +  5n  +  2)/2.  Patel's  experience  was  obtained  with  P  >  (n2  +  5n  + 

2)/2;  whilst  Ducksbury  assumed  n  +  1  >  P.  However  in  this  lecture  we 
will  now  assume  A  and  b  have  been  obtained  appropriately  and  consider  the 
solution  of  the  set  of  equations 


1  WUtfUHmj  W  W  WTOCTPffTOOOTTOro  CTTTTO^CTTf  w  IV  is  W  1%  w.  w  v.r  V 


3.  The  Solution  of  a  Set  of  Linear  Equations:  Introductory  Comments 
The  problem  of  solving  a  set  of  equations 
Ax  =  b 

is  one  of  the  most  frequently  occurring  problems  in  numerical  computation. 


Because  of  the  frequency  with  which  it  occurs  it  has  been  intensely 
researched  and  there  are  very  many  variants  of  most  algorithms  designed  to 
solve  it.  It  is  probably  however  true  that  on  a  single  processor  machine 
most  numerical  analysts  would  prefer  to  use  one  of  four  standard  methods 
if,  as  will  be  assumed  in  this  lecture,  A  is  dense. 

The  four  broad  classes  of  method  we  will  consider  are: 

1.  Gaussian  Elimination 

2.  QU  Decomposition 

3.  Iteration 

4.  Conjugate  Direction  Techniques. 

Within  the  first  category  we  will  consider  four  variations: 

1.1  Choleski  Decomposition 


1.2  Total  Pivoting 

1.3  Partial  Pivoting 

1.4  Gauss-Jordan. 

Similarly  within  the  second  category  we  have: 

2.1  Householder's  Method 

2.2  Givens'  Method 

2.3  The  Gramm  Schmidt  Method 
and  in  the  third  category: 


3.1  S.O.R. 


3.2  Gauss-Seidel  Method 

3.3  Jacobi  Method. 

In  each  of  the  above  three  classes  of  method,  I  have  attempted  to  list 
the  variations  in  the  popular  order  for  a  sequential  machine,  i.e.  on  a 


I 

* *  ji 
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sequential  machine  I  suspect  most  people  would  use  Choleski  decomposition 
if  the  matrix  is  symmetric  and  positive  definite.  Whilst  on  a 
nonsymmetric  or  nonpositive  matrix  most  people  prefer  to  use  a  partial 
pivoting  code,  the  case  for  total  pivoting  on  really  badly  conditioned 
problems  is  accepted  but  disliked.  The  Gauss-Jordan  variant  of  either  is 
rejected  as  more  expensive. 

The  questions  we  shall  consider  in  this  lecture  are  which  algorithm  we 
should  use  in  RN  with  p  processors  and  secondly  how  does  that  answer  vary 
with  n  and  p  and  the  relative  compute  and  data  transfer  times  of  the 
system. 


4.  Gaussian  Elimination  Methods 

Most  Gaussian  codes  consist  of  three  parts 

(1)  triangularisation 

(2)  pivoting 

(3)  back  substitution 

in  which  the  triangulation  and  pivoting  interact  and  are  both  completed 
before  the  back  substitution  starts. 


4.1  Triangularisat ion 

Without  pivoting  triangularisation  takes  the  simple  form 


FOR  k  =  1  TO  N  -  1 


FOR  i  =  k  +  1  TO  N 


FOR  j  =  k  +  1  TO  N 


FI 


-  axkak3/akk 


NEXT  j,  i,  k. 


7 


Most  early  numerical  analysis  texts  commented  that  it  is  preferable  to 
rewrite  this  as: 

FOR  k  =  1  TO  N  -  1 
FOR  i  =  k  +  1  TO  N 

c  =  a. ./a.  .  F2 

i  k  k  k 

FOR  j  =  k  +  1  TO  N 


NEXT  j,  i,  k. 

This  change  was  recommended  because  on  most  sequential  machines 
t { 1  DIV  +  (N  -  k)  mults}  <  t{(n  -  k)Divs). 

Parkinson  observed  that  this  is  no  longer  true  on  either  a  pipeline  or  a 
DAP  processor.  On  a  pipeline 

t{l  Div  +  pipeline  mult}  >  t{chained  mult/divisions} 
whilst  on  the  DAP  the  original  code  is  3  simple  matrix  operations,  a 
property  lost  in  the  modified  form. 

For  those  using  Fortran  it  is  often  further  recommended  that  for  data 
access  reasons  it  is  preferable  to  do 
FOR  k  =  1  TO  N  -  1 
FOR  i  =  k  +  1  TO  N 


NEXT  i 

FOR  j  =  k  +  1  TO  N 
FOR  i  =  k  +  1  TO  N 


F3 


NEXT  i,  j,  k. 

While  these  changes  may  seem  trivial  such  considerations  can  significantly 
alter  the  efficiency  of  codes  and  their  effect  in  a  particular  architecture 


should  be  borne  in  mind. 


4.2  The  partial  pivoting  operation 

In  partial  pivoting  the  largest  element  of  n  -  k  elements  in  the  pivot 
row  is  determined  between  the  k  and  i  loops.  On  a  sequential  machine  this 
implies  n  -  k  comparisons.  On  a  parallel  machine  the  number  of 
comparisons  needed  to  compare  R  numbers  on  P  machines  depends  on  R  and  P. 

For  instance  if  P  =  4,  R  =  16  it  takes  5  steps 


Step  1  C(NX;  N2 ) ,  C(N3 ,  N4),  C(N5,  Ng  ) ,  C(N?,  N, ) 

2  c<Mi,2*  M3,4><  C(N,,  Nio>,  C(Ms>6,  NX1),  C(N7  8,  N12) 

3  C<M1.2,3,4’  C<N13>  N14>’  C(M5,6,ll’  N! 5 > *  C<M7,8,12’  N1 6 > 

^  ^(^1  ,  2  ,  ,  4  ,  9  ,  1  0  ’  ^  1  3  .  1  4  ^  ’  ~  *  ^^5,6,11,15’  ^7,8,12,16^’ 

3  ^^1,2,  3,  4,  9,  10,  13,  1  4  ^  ’  ^5,6,11,15,7,8,12,14^’  “  ’  ~  ’  ~  ' 

The  case  usually  quoted  assumes  p  =  R/2  when  the  comparison  requires  log2R 
steps  if  R  =  2k . 

4 . 3  The  partial  pivoting  algorithms 

On  a  sequential  machine  the  algorithm  requires  yn3  +  0(n2) 
multiplications  and  divisions,  and  0(n2)  comparisons.  The  times  for  the 
comparisons  are  usually  ignored. 

On  a  parallel  machine  with  P  =  (n  -  l)2  processors  Sameh  and  Kuck 
(1977)  showed  the  most  active  processor  only  needed  3(n  -  1) 
multiplications/  divisions,  but  0(n  log  n)  comparisons.  For  large  n  pivot 
comparisons  dominate!  Lord,  Kowalik  &  Kumar  (1983)  who  tested  many 
algorithms  on  the  HEP  (P  =  8)  computer  for  which  the  assumption 
P  =  (n  -  l)2  was  inappropriate,  re-examined  the  algorithm  assuming 
P  =  |n/2],  they  showed  that  the  algorithm  then  requires  n2  -  1 
multiplications  and  divisions  on  the  critical  path. 

Their  Speed  up  =  yn3/n2  -  1  =  jn  Efficiency  =  yn/y  =  |. 


The  Sameh/Kuck  and  Lord/Kowal ick  &  Kumar  methods  are  contrasted  below  for 


Sameh/Kuck  P  =  9 


Lord  Koval ik  &  Kumar  P  =2 


1. 

Select  Pivot  and  form  r,  = 

k 

1/ 

a  k  k 

1. 

The  same 

2. 

Calculate  c.  =  aikrk  (PAR) 

2. 

C  2  =  a2krk 

3. 

a  (2)  o  u  >  „  „  (1) 

a..  =  a  -  c  a.  . 

i  ]  i :  i  k  3 

(PAR) 

3. 

C  3  =  a3kCk 

4. 

Select  pivot  and  form  rk  = 

1/akk 

4. 

C4  =  a4krk 

5. 

Calculate  ci  =  aik  21  rk  (PAR) 

5. 

a  (2) 
a2  2 

a2  3 

(  2  ) 

6. 

_  (3)  <  2  )  (2) 

a..  =  a.  .  -  c. a.  . 

1  ]  i  3  1  k  3 

(PAR) 

6. 

a  ,2’ 

a3  2 

a3  3 

(  2  ) 

7. 

Select  pivot  and  form  rk  = 

1/akk 

7. 

a  <2) 

a4  2 

a4  3 

(  2  ) 

8. 

Calculate  ct  =  aik<3)  rk 

8. 

Pivot  and  rk 

a24 

t  2  ) 

9. 

a.  ,<4)  =  a.  ,<4)  -  c . (a  . 1 4 1 

i  3  i  2  1  '  k  ] 

9. 

C3 

a3  4 

(  2  ) 

10. 

C  4 

34  4 

(  3  ) 

Note.  L/K/K  could  be  improved 

at  the 

11. 

a  (3) 
a3  3 

a34 

(  3  ) 

beginning  but  this  would 

alter 

12. 

a  <3) 

a4  3 

a4  4 

(  3  ) 

their  calculation 

13. 

Pivot  and  rk 

14. 

C4 

15. 

(  4  ) 

3  4  4 

Amusing  variants  can  be  constructed,  for  instance  P  =  3 


1. 

Select 

pivot  and 

form  r.  = 

k 

1/akk 

2. 

C2 

C3 

C  4 

3. 

a  ,2) 

a  121 

a  (  2  ! 

a2  2 

a2  3 

3  2  4 

4. 

a  121 

a  (21 

a  (2! 

a3  2 

a33 

a34 

5. 

a  ,3) 

a  (  3  1 

a  1  3  ’ 

a4  2 

a4  3 

34  4 

6. 

Select 

pivot  and 

form  r.  = 

k 

1/akk 

7. 

C3 

C4 

8. 

a  (3’ 

3  3  3 

a  (3) 

a3  4 

9. 

a  ,3) 

a  (3> 

a4  3 

a4  4 

10. 

Select 

pivot  and 

form  r,  = 

k 

1/akk 

Comparing  the  four  codes  we  have 
P  =  1  C  =  23  C*P  =  23 

P  =  2  C  =  15  C*P  =  30 

P  =  3  C  =  12  C*P  =  36 

P  =  9  C  =  9  C*P  =  81 

so  our  speed  up  is  always  obtained  at  a  processor  *  time  cost. 

No  attention  has  been  given  so  far  to  data  transfer  times  i.e.  the  time 
necessary  to  transfer  the  data  to  the  processor  undertaking  the  compute. 

We  will  return  to  this  point  in  section  4.5. 

4.4  Total  pivoting 

In  total  pivoting  at  iteration  k  we  need  to  compare  (n  -  k)2  numbers. 
This  would  dominate  the  cost  even  more  for  large  n  and  small  P;  but  is 
ideal  on  the  ICL  DAP  where  there  is  a  special  fast  Max(Ai ^ )  instruction  for 
determining  the  maximum  element  of  a  matrix. 

4 .5  The  Gauss  Jordan  method 

Gaussian  elimination  is  usually  completed  by  back  substitution. 

Kuck  (1977)  has  shown  that  this  can  be  done  in  3(n  -  1)  steps  on  n  -  1 
processors.  However  it  should  never  be  done  on  such  a  system  since  the 
unused  processors  are  available  precisely  when  they  are  needed  to  do  the 
Gauss  Jordan  calculation.  If  we  insert  n  additional  rows  to  update  RHS  we 
now  only  need  one  additional  step  to  complete  the  solution. 

For  data  transfer  considerations  it  is  better  to  use  n  or  (n  +  1) 
processors  than  n  -  1  as  then  computations  involving  one  row  or  column  can 
be  done  on  a  particular  processor.  If  this  is  done  then  it  is  efficient 
to  update  the  pivot  row  in  parallel  at  each  step  by  making  the  pivot 
element  unity. 


Gauss  Jordan 


n  =  4 


P  =  4 


1. 

Select  pivot  and 

form  r.  = 

k 

!/akk 

2. 

ci 

C2 

C  3 

C  4 

3. 

a  121 

ai  2 

a  121 

a22 

a  (2) 

3  3  2 

a  (2) 

1  2 

4. 

a  (2) 

1  3 

a  (2) 

a2  3 

a  <2) 

a3  3 

a  (2) 

a4  3 

5. 

a,.”' 

a  121 

a3  4 

a44<2’ 

6. 

l  (  2  ) 

b3 

b4,2> 

7. 

Select  pivot  and 

form  r.  = 

k 

1/akk 

8. 

ci 

C2 

C  3 

C4 

9. 

(  3  ) 

3 1  3 

(  3  ) 

a23 

a  ,3) 
a3  3 

a,/’' 

10. 

a24<31 

a34<3’ 

a./’' 

11. 

b,'-' 

b2<3' 

b3<3' 

12. 

Select  pivot  and 

form  rR  = 

1/a,» 

13. 

c. 

C3 

c. 

14. 

a,.'*’ 

a  <4) 
a34 

a..'*' 

15. 

b/*' 

L  M  ) 
fa3 

b/<’ 

16. 

Form  r.  =  1/a.  . 

k  k  k 

17. 

C1 

C  2 

C3 

c. 

18. 

K  (  5  1  „  K  (  5  1 

bX  =  XX  b2  = 

X2  b3 

X3  b4  = 

It  is  probably  vorht  repeating  that  on  most  parallel  machines  it  will  be 
more  efficient  to  omit  the  separate  calculation  of  rk  and  compute  c.  by 
division  rather  than  multiplication.  If  the  pivot  calculation  is  ignored 


the  number  of  steps  is 
n  +  1 

E  k  -  1  -  ?(n  +  1 ) (n  +  2)  -  1. 
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5.  Orthogonal  AU  Decomposition  Techniques 


The  parallel  analysis  of  both  Householder's  and  Givens'  method  was 
given  by  Sameh  and  Ruck  (1977).  They  showed  that  given  P  =  0(n1 2 3 4 5 * 7) 
processors  Householder's  method  required  0(n  log  n)  operations  but  Givens' 
method  only  required  0(n).  This  contrasts  with  the  sequential  situations 
where  Householder's  method  is  the  more  efficient.  This  comparison  was 
tested  and  confirmed  by  Sorenson  (1985). 

The  Sequential  Givens  routine  consists  of 
1.  FOR  0  =  1  TO  N  -  1 


2.  FOR  P  =  Q  +  1  TO  N 


S 

=  aQQ//aPC  + 

FOR  J 

=  1  TO  N 

TEMP 

=  C)*apj  +  S  a( 

a  , 

=  -  S/ta0 ,  +  i 

Q.7 

P  J 

aPJ 

=  TEMP 

c  =  aPQ//aP52  +  a0Q2 


4.  Next  j  ,  P,  Q. 

Note  that  the  J  loop  is  an  ideal  pipeline  calculation.  Typically 
operation  3  is  considered  1  task. 

The  essential  feature  of  Givens'  method  is  that  a  pair  PQ  only  alters 
rows  P  and  Q  so  a  number  can  be  performed  in  parallel. 

The  earliest  discussion  of  a  parallel  Givens  code  was  that  due  to  Sameh 
and  Kuck  (77)  who  reduced  ApQ  to  0  in  the  following  order  for  an  n  x  n 
matrix 


1  * 

2  3  * 

3  4  5  * 

4  5  6  7  * 

5  6  7  8  9  * 

678  9  10  11  * 

7  8  9  10  11  12  13  * 


3  * 

2  5  * 

2  4  7  * 

1  3  6  8  * 

1  3  5  7  9  * 

1  2  4  6  8  10  * 

1  2  3  5  7  9  11  * 


Sameh  &  Kuck  (77) 


Modi  &  Clarke  (84) 


i'l Vl.i  l  >  >  j  UIVIv'ij 


»!« 


Their  code  performed  13  parallel  Givens  steps  on  4  processors.  In  general 
(2n  -  3)  parallel  Givens  steps  on  j  processors.  As  each  parallel  Givens 
step  involves  4n  +  4  operations  the  speed  up  is  dominated  by  4n3/3(2n  - 
3)(4n  +4);  so  S  =  |n  and  E  =  j.  Many  attempts  at  improving  the  Sameh  & 
Kuck  method  have  been  proposed  but  Cosnard  (1985)  has  shown  that  the 
"greedy"  algorithm,  Modi  and  Clarke  (1984),  is  optimal.  This  is  also 
shown  above,  and  only  requires  11  parallel  Givens  steps  on  4  processors. 
Cosnard  states  that  the  efficiency  of  this  method  is  asymptotically  |  but 
if  we  utilise  |  parallel  pipelines  when  performing  the  inner  loop  the 
algorithm  is  much  more  powerful  than  this  would  indicate. 

Kowalik,  Kumar  and  Kamgria  (1983)  noted  that  if  2(k  +  1)  processors  are 
used  to  perform  the  inner  loop  it  only  requires  4  steps.  Again  analysing 
the  nonoptimal  sort  they  concluded  that  with  P  *  jn2  +  |n  they  could  obtain 
a  speed  up  S  *  ln2/6  +  0(n),  C  «  8n  -  12.  Adding  the  3n  -  3  operations 
for  parallel  back  substitution  they  state  the  parallel  operation  count  as 
lln  -  5.  They  also  analysed  the  case  with  P  =  |(n-l)/2]  and  claimed 
C  =  6n2  +  8n  -  25  giving  S  =  |n  and  E  =  |  nil'  These  times  were  verified 
on  their  test  runs  on  the  HEP. 

6.  Iterative  Methods 
6.1  Jacobi's  method 

To  solve  Ax  =  b  first  scale  so  Du  =  1  then  iterate 

x(k  +  1 1  =  (I  -  A)x(k*  +  b. 

This  method  is  convergent  if  ||I  -  A||  <1  but  very  slow  when 
IH  -  AM  >  .9. 

The  method  is  however  ideal  for  parallel  processing  as  we  can  do  all 
the  rows  in  parallel  if  P  >  n  and  P  rows  in  parallel  if  P  <  n.  If  each 
processor  is  a  pipeline  it  is  even  better,  as  each  row  update  is  an  ideal 
pipeline  operation. 


4  *•  «  »  -  ■P«  w\  h 


It  is  therefore  possible  to  speed  up  the  very  poor  method 
significantly. 

6.2  Gauss-Seidel  method 

The  ordinary  Gauss-Seidel  algorithm 


(  k  +  1  )  (  k  ) 

X  =  X 


E  Ax, 

j-1  13  5 


-  .  (  k  I  , 

E  A  x .  +  b 

J-1  0  3 


is  not  suitable  for  parallel  computing  due  to  the  x..(k  +  11  term  on  RHS. 
This  can  however  be  adapted  for  a  block  Gauss-Seidel  approach  with 


P<n.  If  q  s.t.  Pq  <  i  <  P 


(  q  +  1  ) 


<  k  f  1  )  (  k  ) 

X  =  X 


r  a  I  k  t  1  !  _ 

E  A  x  -  E 


A.  x  .  1  k  1  +  b 


,q  +  l  *3  3 


allows  P  values  of  x  to  be  evaluated  in  parallel. 
6.3  Assynchronous  Jacobi  or  Gauss  Seidel  Method 


As  an  alternative  to  the  Block  Gauss  Seidel  method  Baudet  (1978)[17] 
proposed  the  assynchronous  algorithm  where 

n 

(  k  a  1  )  <  k  I  _  .  L  , 

x  =  x ,  -E  A  x+b 

1  1  .,133  1 

J=1 

is  used  where  xL  is  the  latest  value  in  the  store  when  it  is  accessed. 

3 

Allowing  P  =  n  processors,  1  for  each  i,  and  slight  variations  of 
performance  x^  could  be  either  x^  ,k)  or  x^(k  +  1>.  His  analysis  indicated 
there  would  be  no  loss  of  performance. 

7_. _  Conjugate  Gradient  Method 

The  sequential  code  consists  of  the  following  steps: 

1 .  Choose  x'  0 1  (usually  0) ;  e;  rQ  =  b  -  Axfl ,  d0  =  r0  while  |  |  r  |  |  >  e  do 


2,3,4. 


2.x  =  x  fad 

-I  -1  —1 


3.  r  =  b  -  Ax 


4.  d  -  r  +  0d 


a  =  rTd/dT Ad 


6  =  R*Tr*/rTr. 


A 


% 

‘v 

$ 

3 


Steps  2,3,4  are  ideal  for  a  parallel  processor. 


with  P  =  n,  the  calculation  of  Ax  and  Ad  requires  n  parallel  steps 


rTd,  dTAd,  rTr  requires  log  n  parallel  steps. 

So  one  iteration  costs  2n  +  3  log  n  +  4  steps. 

If  p  =  n2  we  can  do  all  the  multiplications  in  Ax  and  Ad  in  parallel  so  one 
iteration  is  only  3  log  n  +  6  steps. 

8.  Preconditioning 

In  practice  on  a  sequential  machine  both  Jacobi  and  Conjugate  Gradients 
are  preconditioned  before  use.  In  the  Jacobi  method  this  involves  the  use 
of  either  a  matrix  C  or  a  matrix  M  where  the  equation  becomes 
CAx  =  Cb  or  M  1  Ax  =  M_1b. 

For  efficiency  we  require  C  to  approximate  A  1,  or  M  to  approximate  A;  if 
we  are  using  M  we  need  Hz  =  v  to  be  much  easier  to  solve  than  Ax  =  b; 
while  if  we  are  using  C  we  need  C  to  have  a  nice  sparsity  pattern. 

The  preconditioned  Jacobi  method  is  then  either 
xk+1  =  (I  -  CA)xk  +  Cb  or  xk+1  *  xk  +  z 

where  Mz  =  b  -  Axk . 

In  the  conjugate  gradient  method  we  must  maintain  symmetry  so  we  use 
CACTy  =  Cb  where  x  -  CTy 

and  we  require  CACT  to  be  wellconditioned  and  C  to  have  a  nice  sparsity 
pattern. 

Whilst  the  method  could  be  programmed  in  terms  of  y  it  is  more  usual  to 
substitute  into  x  space  then  the  initial  step 
d,  =  AXj  =  CTC(Ax  -  b) 
rt  =  AXj  -  b;  z;  =  C(Axt  -  b) 
and  at  subsequent  iterations 
x*  =  x  +  ad 

r+  =  r  +  aAd 

z+  «  z  +  oCAd 

d*  =  Cz+  +  fid 


where  a  =  -  r  d/d  Ad  and  0  =  z  z  /z  z. 

The  usual  choices  of  conditioning  matrix  on  sequential  machines  are  the 
Neumann  or  incomplete  Choleski . 

The  Neumann  conditions  used  the  principle  if  G  =  (I  -  A)  then 
(I  G)  1  =  I  +  G  +  G2  +  G3  ...  and  truncate  this  series  for  C. 

On  sequential  machines  these  are  not  as  fast  as  the  incomplete  Choleski 
factorisation,  where  a  Choleski  factorisation  is  done  without  fill  in  to 
prserve  sparsity.  On  pipelines  and  parallel  computers  the  presence  of 
these  matrices  puts  up  the  CPU  time,  as  the  algorithms  are  no  longer 
readily  adapted  for  parallel  processing. 

A  typical  set  of  results  is  that  given  by  Blumenfield  (1983) 


Preconditioner 


I  +  G 


I  +  G  +  G 


I  +  G  +  G2  +  G3 


Incomplete  Choleski 


No.  of  Iterations  Pipeline  Time 


The  problem  of  preconditioners  for  parallel  and  pipeline  systems  needs 
further  research. 

9. _ Conclusions 

In  this  lecture  I  have  tried  to  illustrate  some  of  the  considerations 
that  must  be  taken  into  account  when  selecting  an  algorithm  for  use  on  a 
parallel  processor  architecture.  I  hope  to  have  shown  convincingly  that 
the  assumption  that  any  sequential  algorithm  can  be  selected  without 
careful  consideration  is  false. 

Of  the  Gaussian  type  methods  the  Gauss-Jordan  version  is  recommended, 
for  QR  decomposition  the  Greedy  parallel  Givens  method  is  the  obvious 
choice.  Foi  iteration  methods  the  Block  Gauss  Seidel  is  effective  and 


simple  to  program  as  is  the  conjugate  gradient  method.  Great  care  needs 
to  be  taken  when  introducing  preconditioning  matrices  into  parallel 
computation. 
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